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Abstract 

We consider how to learn multi-step predictions efficiently. Conventional algo¬ 
rithms wait until observing actual outcomes before performing the computations to 
update their predictions. If predictions are made at a high rate or span over a 
large amount of time, substantial computation can be required to store all relevant 
observations and to update all predictions when the outcome is finally observed. 
We show that the exact same predictions can be learned in a much more compu¬ 
tationally congenial way, with uniform per-step computation that does not depend 
on the span of the predictions. We apply this idea to various settings of increasing 
generality, repeatedly adding desired properties and each time deriving an equiva¬ 
lent span-independent algorithm for the conventional algorithm that satisfies these 
desiderata. Interestingly, along the way several known algorithmic constructs emerge 
spontaneously from our derivations, including dutch eligibility traces, temporal dif¬ 
ference errors, and averaging. This allows us to link these constructs one-to-one 
to the corresponding desiderata, unambiguously connecting the ‘how’ to the ‘why’. 
Each step, we make sure that the derived algorithm subsumes the previous algo¬ 
rithms, thereby retaining their properties. Ultimately we arrive at a single general 
temporal-difference algorithm that is applicable to the full setting of reinforcement 
learning. 


1 Learning long-term predictions 

The span of a multi-step prediction is the number of steps elapsing between when the 
prediction is made and when its target or ideal value is known. We consider the case in 
which predictions are made repeatedly, at each of a sequence of discrete time steps. For 
example, if on each day we predict what a stock market index will be in 30 days, then the 
span is 30, whereas if we predict at each hour what the stock market index will be in 30 
days, then the span is 30 x 24 = 720. 

The span may vary for individual predictions in a sequence. For example, if we predict 
on each day what the stock-market index will be at the end of the year, then the span 
will be much longer for predictions made in January than it is for predictions made in 
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December. If the span may vary in this way, then we consider the span of the prediction 
sequence to be the maximum possible span of any individual prediction in the sequence. 
For example, the span of a daily end-of-year stock-index prediction is 365. Often the span 
is infinite. For example, in reinforcement learning we often learn value functions that are 
predictions of the discounted sum of all future rewards in the potentially infinite future 
(Sutton and Barto 19981. 

In this paper we consider computational and algorithmic issues in efficiently learning 
long-term predictions, defined as predictions of large integer span. Predictions could be 
long term in this sense either because a great deal of clock time passes, as in predicting 
something at the end of the year, or because predictions are made very often, with a short 
time between steps (e.g., as in high-frequency financial trading). The per-step compu¬ 
tational complexity of some algorithms for learning accurate predictions depends on the 
span of the predictions, and this can become a significant concern if the span is large. 
Therefore, we focus on the construction of learning algorithms whose computational com¬ 
plexity per time step (in both time and memory) is constant (does not scale with time) 
and independent of span. 

This paper features two recurring themes, the first of which is the repeated spontaneous 
emergence of, often well-known, algorithmic constructs, directly from our derivations. We 
start each derivation by formalizing a desired property and constructing an algorithm 
that fulfills it, without considering computationally efficiency. Then, we derive a span- 
independent algorithm that results on each step in exactly the same predictions. Interest¬ 
ingly, each time a specific algorithmic construct emerges, demonstrating a clear connection 
between the desideratum (the ‘why’) and the algorithmic construct (the ‘how’). For in¬ 
stance, the desire to be independent of span leads to a dutch eligibility trace, which was 
previously derived only in the more specific context of online temporal difference (TD) 
learning (van Seijen and Sutton 2014). 

The second theme is that we unify the algorithms at each step. Each time, we make 
sure to obtain an algorithm that is strictly more general than the previous ones, so that in 
the end we obtain one single algorithm that can fulfill all the desiderata while remaining 
computationally congenial. 


2 Outline of the paper 

In this section, we briefly describe the high-level narrative of the paper, without going 
into technical detail. In each of the Sections 3 to 8, we describe and formalize one or 
more desirable properties for our algorithms and then derive a computationally congenial 
algorithm that achieves this exactly. We build up to the final, most general, algorithm that 
is ultimately derived in Section 8 to highlight the connections between desired properties 
and algorithmic constructs. Making these connections clear is one of the main goals of 
this paper. 

Specifically, in Section we derive a span-independent algorithm to update the predic¬ 
tions for a single final outcome. The algorithm is offline in the sense that does not change 
its predictions before observing the outcome. The dutch trace emerges spontaneously, 
which shows that this trace is closely tied to the requirement of span-independent com¬ 
putation. This emergence is surprising and intriguing because it shows that these traces 
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are not specific to online TD learning, for which they were first proposed (van Seijen and 
Sutton 2014[ ). 

In Section]^ we derive span-independent updates that update the predictions online^ 
towards interim targets that temporarily stand in for the final outcome. We show that 


the desire to be online results in the spontaneous emergence of TD errors (Sutton 1984 


Sutton 19881. In this paper we are mostly agnostic to the origin of the interim targets. 
These may for instance be given by external experts or by own online predictions, as in 
standard TD learning (e.g., see Sutton and Barto 19981. 

It can be beneficial to be able to switch smoothly between online and offline updates, 
on a step-by-step basis, for instance when we do not full trust some of the interim targets 
that we would use for our online updates. This allows us to have the best of both worlds: 
the online predictions stay trustworthy even if some interim targets are wrong, and we 
are still able to use any useful information immediately when it is observed. In Section 
[^we consider how to do this efficiently and from our derivation an update emerges that 
averages the online weights in a separate trusted weight vector. This is interesting because 
such averaging is known to improve the convergence rates of online learning algorithms 
(Polyak and Juditsky I992[ Bach and Moulines 2013), but seems to only rarely be used in 
reinforcement learning (as noted, e.g., by Szepesvari 2010). 

Some interim targets may be so informative that we want their effect to persist in the 
predictions even after observing the final outcome. For instance, if the final outcome is 
stochastic and the interim targets are drawn independently from the same distribution it 
makes sense to average these instead of committing fully only to the final outcome. In 
the extreme, we might see an interim target that we trust so much that we do not even 
care about the actual outcome anymore, for instance because the interim target already 
takes into account all possible outcomes from that moment rather than only the specific 
one that will happen to materialize this time, resulting in a more accurate prediction on 
average than a single final outcome. In Se ction [6{ we formalize these ideas and show they 
lead naturally to a form of TD(A) (Sutton 1988 Sutton and Barto 1998). 

The A parameter that governs the amount of persistency of the interim targets can be 
interpreted as representing a degree of trust: if we trust an interim target fully (A = 0) 
we do not need to consider later observations, while if we distrust it fully (A = 1) it will 
be replaced by later targets and leave no trace in the final predictions. This is a different 
notion of trust than the one considered for the smooth switching between online and offline 
updates, where the trust was relative to the actual final outcome rather than the expected 
outcome. These two forms of trust are compatible and complimentary, and in Section 
we show how to combine them into a single algorithm. 

Up to Sectionj^ we have only considered predicting a single final outcome in an episodic 
setting. In Section we consider how to deal with two important generalizations of the 
problem setting: cumulative returns, and soft terminations. Cumulative returns allow us to 
see part of the return on each step, and allow us to start learning from these partial returns 
immediately in the online setting. Soft terminations allow us to learn about predictions 
that may conditionally terminate even if the actual process continues, and they allow for 
non-episodic predictions that may terminal softly on each step rather than completely at 
a single point in time. This leads to a single final algorithm that subsumes all previous 
algorithms as special cases. The algorithm is similar to the conventional TD(A) algorithm 
but with important differences that ensure that it is exactly equivalent to the desired, but 
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inefficient, algorithm and therefore inherits all its desirable properties. 

Because our final algorithm is novel, it is appropriate to analyze it. In Section we 
prove that the algorithm is convergent under typical mild conditions, and that it converges 
to the same solution as similar previous algorithms, including TD(A). 

We conclude the paper with a short discussion in Section [TOl 

3 Independence of span and the emergence of traces 

We start with a supervised learning setting—predicting the final numeric outcome of an 
episodic process. An episode of the process starts at time t = 0 and moves stochastically 
from state to state generating feature vectors 4>t until termination with a final numeric 
outcome Z at final time T. For example, Z could be the price of a particular stock that 
we want to predict, and each episode may be a year, such that time T corresponds to the 
end of the year. 

We consider the general case of multi-step predictions (T > 1), where a prediction is 
made on each step. The standard supervised learning setting is a special case where in 
each episode we only make one prediction (such that, without loss of generality, we can 
take T — 1). 

Our predictions are linear, that is, the prediction at time t is the inner product of (pt 
and a learned weight vector 0, denoted (pj 9. The algorithms are indifferent to the origins 
of the features, which may be handcrafted or learnedQ The weights have an initial value 
00 that is presumably due to previous episodes. We analyze how the weights change in a 
single episode (and thus we do not include the episode number in our notation). 

At the final time, when Z is observed, we can update all the predictions towards the 
target as in the classical least mean squares (LMS) algorithm defined by the updates: 

9t+i = dt + Oitcpt [Z — <pj6t) , t = 0,... ,T — 1, (1) 

where a* > 0 is a step-size parameter that may vary from time step to time step (e.g., as 
a function of the state at that time). We call this a forward view, because to update the 
prediction at time t we need to look forward in time to the outcome Z which is observed 
at the later time T. 

To perform the updates 0 we have to wait until Z is known and then do the update 
for all previous time steps t. This requires storing and then computing updates for all the 
preceding feature vectors. The required computational resources scale with the span of the 
prediction (the maximum length of an episode), which is what we wish to avoid. We seek 
incremental computations whose per-time-step complexity is 0{n), where n is the number 
of parameters, and that result in the same weights as 0 at the end of the episode. That 
is, the incremental updates should compute the same 9t as 0 if they are given the same 
input (the same Oq, the same sequence and the same Z). 

It may seem that the best we can hope for is to approximate the result computed by the 
LMS algorithm, because of the strict computational restriction. Such a trade off between 
computation and accuracy is not uncommon. We will however now derive an algorithm 

^This includes, for instance, the case where <pt is the last hidden layer of a neural network. 


4 



that finds the exact same final predictions with much more congenial computation, by 
carefully analyzing the total change to the weight vector due to the LMS algorithm. 

The final step of the algorithm in Q can be rewritten as 

9t = 9t-1 + OtT-l(pT-l {z — 

= 6t-i + aT-i4>T-iZ — 

= (I — aT-i4>T-i4>T-^ 9 t-i + aT-i4>T-iZ 
= Fx—i9t—i + C(t—i<Pt—iZ ■ 

Here Fj = I — is a fading matrix that will be important throughout this paper. 

Now, continuing, 

Oj' = F7’_x (F7^_2^t —2 “b Oi'n—24^T—2Z) T o^x—i4^T—iZ (expanding Bx_i) 

= Ft-iFt-29t-2 + (Fj’_ia 7 ’_ 2 </>T -2 + Z (regrouping) 

= Fj’_iFr_2 (Fj’_30j’_3 + aT-3<pT-3Z) + {Ft-iOIT-2(Pt-2 + Q!T-l</’T-l) Z 

(recursing on 0 t- 2 ) 

= Ft-iFt-2Ft-39t-3 + (F7’_iF 7’_2Q!T-3</’T-3 + Fj’_ia7’_2<^T-2 + OtT-l<f>T-l) Z 

(regrouping) 


(recursing further) 


^T-l 


— FT_iFr_2 • • • Fq^o 

'-V-" 

= ttT-l 

= ClT-1 + ^T-lZ , 


Fr_iF7’_2 ■ • - Ft+iatcft 1 Z 


, t=o 


= gt-i 


(2) 


where at and e* are two auxiliary memory vectors. 

Importantly, the auxiliary vectors can be updated without knowledge of Z, and with 
complexity independent of span and proportional to the number of features. The at vector 
stores the effect of the initial weights on the updated weights. It is initialized as ag = Bo 
and can then be updated efficiently with 


— FtF(_i • • • Fg^o 

^Ft (Ft_i-..Fiei) 

= Ftat_i 

= o,t-i — ott4>t4>t o-t-i 

= at-i + at(ft{0 - 4>J at-i), t = -1. (3) 


The Bt vector is analogous to the conventional eligibility trace (see: Sutton 1988[ Sutton 
and Barto 1998 and references therein) but has a special form as first proposed by van 
Seijen and Sutton (2014). It is initialized to e_i = 0 (or, equivalently, to eg = aocfo) and 
then updated according to 


t 

= y]] FtF4_i • • • 

k=0 
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FtFt_i • • • + at<pt 

k=0 

t-i 

Ft ^ Ft_iF4_2 • • ■ Ffc+iafc</>fc + at4>t 


k=0 


= et_i 

Ftet_i + atcj)t 

Gf-i — et_i + at4>t 

s-t-i + Oit4>tiX ~ 4’t t = 0,... ,T — 1. 


(4) 

(5) 


An eligibility trace of this special form is called a dutch trace (van Hasselt, Mahmood, and 
Sutton 20141. For comparison, the conventional accumulating trace that it replaces can 
be written as e_i = 0 and e* = et_i + Ot^t 

The emergence of the dutch trace here is surprising and intriguing because, in contrast 
to previous work (van Seijen and Sutton [^014[ van Hasselt, Mahmood, and Sutton 2014), 
the dutch trace has arisen in a setting without temporal-difference (TD) learning. Eligi¬ 
bility traces are not specific to TD learning at all; they are more fundamental than that. 
The need for eligibility traces seems to arise whenever one tries to learn long-term predic¬ 
tions in an efficient manner, that is, with computational complexity that is independent 
of predictive span. 

The auxiliary vectors at and are updated on each time step t < T and then, after 
observing Z at time T, are used to compute 9^ = a^-i + ex-iZ, as in (§• This way we 
achieve exactly the same final result as the forward view Q , but with an algorithm whose 
time and memory complexity per step is uniformly 0{n) and independent of span. The 
complete algorithm can be summarized as: 


ao = 6o , then at+i = at + atcl)ti0 - cl)J at), t = l,...,T-l, 
e_i = 0 , then et = et-i + atcl)t{l - (j}Jet-i), t = 0,...,T-l, 
Ox — Ctj’—i -h Zbx—1 ■ 


( 6 ) 


The vector ax-i can be interpreted as storing the remaining effect of the initial weights 
6o after all updates in Q have concluded. The trace bt-i can be interpreted as storing 
all we need to know about the feature vectors that were observed during the episode. 
Together, these vectors allow us to replace all the T updates of the forward view with one 
fully equivalent update at the end of the episode. 

We call the span-independent algorithm (§ the backward view corresponding to the 
forward view defined in Q, because on each step all updates only use information that is 
available at that time step: we only look backwards in time. The advantage of this is that 
the updates can be computed immediately and we do not have to wait and store observa¬ 
tions until later. Until recently, exactly equivalences between forward and backward views 
were only known to exist for algorithms that update their predictions in batch (Sutton 
and Barto 1998). Van Seijen & Sutton (2014) were the first to derive an online backward 


^We incorporate the step size into both trace updates. This is a slight deviation from the way these 
traces are usually written to allow for time-changing step sizes and increased generality. 
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Memory requirements for different spans for forward and backward view 



s m h day 


Figure 1: Example memory requirements. The bars show memory requirements for 
the eonventional algorithm Q and the span-independent algorithm (|^. The brown bars of 
the conventional algorithm Q that stores all observations over the duration of a second, 
minute, hour, or day, when one million features of 4 bytes each are observed every 10 ms. 
The blue bars on the right show the memory required by the span-independent algorithm 
§ in the same setting, which is 8 MB for any span. 


view that was exactly equivalent to its forward view in terms of the learned predictions. 
The derivation in this section shows that such equivalences exist more generally, including 
for the LMS update in Q. 

When we consider the episode as a whole, there is no gain in total computation time for 
the span-independent algorithm (§ compared to the conventional algorithm 0 : both al¬ 
gorithms use 0(nT) computation for the entire episode. However, in the span-independent 
algorithm the computation is spread out more evenly with a uniform per-step complexity 
of 0{n), whereas the conventional algorithm performs the bulk of computation at the end, 
when we finally observe Z. 

Additionally, there is a gain in terms of required memory. Algorithm 0 needs to store 
all previously observed features, leading to memory requirements of order 0(nT), whereas 
algorithm 0 only needs to store a and e and therefore has constant span-independent 
memory requirements of order 0{n). In real-world problems, for instance in robotics, 
it is not uncommon to extract millions of features from the sensory inputs at each step 
(e.g., Montemerlo and Thrun 2003), where each step lasts only a fraction of a second. 
Consider a robot that generates one million features each 10 ms, where each feature is a 
real number represented with single precision using 4 bytes of memory. Figure[2shows the 
resulting memory requirements for the conventional algorithm and the span-independent 
algorithm for different spans of the predictions. The required storage of the conventional 
algorithm ranges from 0.4 GB for predictions spanning one second to about 35 Terabytes 
for predictions spanning one day. While a few Gigabytes of on-board storage is feasible 
with today’s resources, several Terabits will be a significant burden for an autonomous 
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mobile robot. Concretely, this means that with the conventional algorithm we would have 
to restrict either the number of features, the frequency at which we make predictions, or 
the maximum amount of wall time a prediction can span. The span-independent algorithm 
scales much better. In the example in Figure it only needs to store two vectors with 
10® components of 4 bytes each, resulting in memory requirements that are constant at 
8 Megabytes. 

4 Online updating and the emergence of TD errors 

The algorithms in the previous section do not make any changes to the predictions during 
the episode; these are offline algorithms. Yet it would sometimes make sense to update 
the predictions during an episode, especially if the span is very long and we do not want 
to wait that long before we start learning. In this section we introduce an online forward 
view and derive a span-independent algorithm (the backward view) that on each time step 
computes the exact same predictions. 

An online algorithm cannot update the predictions towards the final outcome Z during 
the episode, because Z is not yet available. Instead, if we only have observations up to a 
horizon h < T we may want to move the predictions for all earlier times t < h towards 
some informed guess of what the final outcome will be. Such a guess plays the role of 
a target for the updates, like Z in the forward view but it is used prior to Z being 
available; it is an interim target. We use Z^ to denote the interim target at time h, which 
may be based on all the data available up that horizon. The interim target might be from 
a human expert or it might be, as in TD learning, the current prediction corresponding to 
the feature vector (f>h- Interim targets at times closer to T might produce more accurate 
predictions. In the example of the stock market, as we get closer to the end of the year we 
may be able to more accurately estimate the final stock price. For now we consider the 
general case and do not specify the source of the interim targets Z^, for h = 1,..., T — 1. 
Notationally it is convenient to define Z'^ = Z. Note that the time index on Z^ is a 
superscript rather than a subscript. Our convention is that the superscript position is 
reserved for the upper limit of the data considered available in an online update. The 
subscript position is used for the time step whose prediction is being modified. 

To clarify the notation under online updating, we introduce the notation 0^ for the 
weights at step t based on all the data up through time h. Using these double subscripts, 
what we previously called 6t would now be 6j, because these weight vectors depend on Z 
which is considered to arrive at time T. The complete set of online updates is then 

9o=6o, h = 0,...,T; 

+ t = 0,...,h-l, h = (7) 

^Ft9!f+at4>tZ>^. 


This online algorithm defines a set of h updates for each interim horizon h € {1,...,T}. 
For each horizon, all predictions are updated towards the latest, and presumably best 
available, interim target. Although most updates do not involve the final outcome Z, this 
algorithm is still considered a forward view, because the prediction at some time t (e.g., 
(f>J9^) is updated using interim targets that arrive later (e.g., Z^). 
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We can write out all the double-subscripted weight vectors in a triangle as 

e\ 

Qh+\ 

■■■ ^h+1 ■■■ • 

The computation proceeds from top to bottom, one row at a time from left to right. Each 
row starts with the same initial values 9q = Oq and then computes the sequence 0^, O 2 
and so on If we really computed all the different weight vectors in the triangle then the 
algorithm would be inefficient. With a computational complexity of 0{nh) on each time 
h G {l,...,r}, the computation would not be constant per time step and would not be 
independent of the span of the prediction; the last row alone has complexity O(nT) and 
it can only be computed after observing Z at time T. However, instead of computing 
the whole triangle, perhaps there is a way to incrementally compute just the diagonal, to 
somehow obtain from 0^ efficiently on each step. If this can be done, then the entire 
computation will be of uniform 0{n) complexity per time step, independent of span. 

To find an efficient update along the diagonal of the triangle, notice first that the 
forward view Q already provides a way to efficiently move right one step in the triangle. 
In other words, we can get to from 6^'^^ for any h with constant 0{n) computation. 
If we can find an efficient way to step down in the triangle, that is to get to from 9^ 
for any h with constant 0{n) computation, then we can combine these two steps into a 
single 0{n) update. To see if this is possible, we first write down explicitly how each weight 
vector in the triangle depends on the initial weights and the interim targets. Similar to our 
derivation of the final weights in ([^ in the previous section, we can unroll the forward-view 
updates repeatedly starting from 9^ and obtaining 

9^ = -f (applying Q to 9^) 

= F4_i(Ft_20f_2 + at- 2 (pt- 2 Z^) + at-i4>t-iZ^ (applying Q to 9^_^) 

= Ft-iFt_29t-2 + (Ft_iat_ 2 ^t -2 + at-i0t-i) Z^ (regrouping) 

= Ft-iFt_ 2 (Ft_ 30 f _3 -I- at-3<pt-3Z^) + {Ft-iat-24’t-2 + cxt-i<pt-i) Z^ 

(applying Q to 0 (^ 2 ) 

= Fi_iFt_2Ft_30f_3 -I- (F(_iFi_2at_3</)t_3 -I- Ft-iat-24>t-2 + Z^ 


— Fi_i • • • Fo0o + 

'--^ 

O.t-1 


— o-t-i + ^t-iZ^ ■ 


(continuing until we reach 0g) 


t-i 


) 

■V ^ 

et-i 


(9) 




n/i+1 

^/l+l 


(8) 


^0° 

^0 




el 
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Notice that at_i and et-i depend only on time t and not on the data horizon h. We can 
use this result to find the difference between and 9^ as 

-0h = K-i + eh-iZ'^+^) - (a„_i + Bh-iZ^) 

= eh-i{Z^+^ - Z^). (10) 

Here we see the emergence of a temporal difference error — Z^. We know from the 

previous section that Bh-i can be computed incrementally with ([^ and therefore does not 
have to be recomputed from scratch for each new observation. Therefore, we now have an 
efficient way to compute 9’^'^^ from 9^ for any h with constant 0{n) computation per step 
that is independent of span. Now that we have 9’^'^^ we can efficiently compute 9^'^\ using 
Q, and we can merge these two steps to compute 9^'^\ directly from 9^. The complete 
update can then be written as 

+ auct^hZ^^^ (using Q) 

= {9^ + eh-i{Z^+^ - Z^)) + ah(i>hZ'^+^ (using @) 

= Yh9^ + ¥heh-i{Z'^+^ - Z^) + ah4>hZ'^+^ 

= Yh9^ + [Eh - - Z^) + ahct>hZ^+^ (using @) 

= Yh9’f, + eh{Z^+^ - Z^) - - Z^) + ahct^hZ^+^ 

= ¥h9^ + eh{Z'^+^ - Z’^) + ah4>hZ'^ 

= (I - atMl Wh + eh{Z'^+^ - Z'*) + ah(f)hZ^ (by definition of F^) 

= ©;( + eh{Z’^+^ - Z^) + ahct>h {Z’^ - ct>l9^h) ■ 

This update holds for all /i > 1. The update for 9\ is given directly by Q as 

e\ = @0 + ao^o(2'^ - ^0 ^o) • 

For any Z^ we can rewrite this as 

9l=9^ + ao4>o{Z^ - (f)o9'^) (using 9^ = 9o = 0[J) 

= 0° + aoMZ^ -Z° + Z°- <f,^9°) 

= 0 ° + aoMZ^ - Z°) + aoMZ° - (pj9°) 

= 0° + eo{Z^ - Z°) + ao(f)o{Z° - , (using Eq = ao^o) 


which means that then the update derived above for /i > 1 in fact also holds for h = 0. 
For concreteness, we will define = 0, even though this value does not actually affect 
any of the weights. 

Now that we have an update that can compute 9l^l from 9l for any t, we can drop 
the redundant superscript. The resulting algorithm is 


e_i = 0 , then e* = Et-i + Q!t0t(l - (pjet-i), t = 0,..., T - 1, 
9t+i = 9t + Et (Z*+i - Z‘) + atMZ* -(pj9t), t = 0,..., T - 1. 


( 11 ) 


By construction this backward view is equivalent to the less efficient forward view Q in 
the sense that 9t = 9l for all t. In contrast to the offline backward view ([^ that we derived 


10 


in the previous section, we no longer need to compute and store the auxiliary vector at- 
All relevant information that was contained therein is now stored directly in the online 
weights Ot- 

Although the online backward view yields different predictions during the episode, the 
final weights 9t are exactly equal to those computed by the conventional LMS algorithm 
Q that constituted our first, offline, forward view. In terms of the triangle in (pL the 
online forward view Q computes the whole triangle, the online backward view ( |11[ ) ef¬ 
ficiently computes only the diagonal, the offline forward view Q computes only the last 
row, and the offline backward view (|^ from the previous section computes only the final 
weights. All three algorithms ultimately result in the same final weights. 


5 Unifying online and offline learning and the emer¬ 
gence of averaging 

The online algorithms from the previous section do not quite subsume the offline algorithms 
from Section]^ Although they all reach the same weights by the end of the episode, during 
the episode their weights are different. The offline algorithm does not change the weights 
during the episode, and the online algorithm must change them. 

One might think that the online algorithm is always better because it can immediately 
use any incoming relevant information, but it is not so. Suppose the interim targets are 
always wildly wrong (say due to a poor human ‘expert’). They would cause the weights 
of the online algorithm to also be wildly wrong for all steps except the last one at the end 
of the episode. In this case the weights of the online algorithm would be worse than those 
of the offline algorithm almost all of the time. 

Because interim targets can sometimes be misleading, we might want to reduce their 
effect on some steps, based on how much we trust these targets. In this section 9t denotes 
the online weights, which are computed by the span-independent backward view The 
unified weights 9t take into account the degree of trust for each interim target and are 
used to make our predictions. If we trust Z* fully, we want to obtain the same weights 
as in the online algorithm, so that 9t = 9t. If we do not trust Z* at all, we want the 
predictions to remain unchanged, so that 9t = ©t-i. For intermediate degrees of trust, 
Pt S (0,1), the algorithm should smoothly move from one extreme to the other, so the 
final result of the update should be something like 

9t+i = (1 ~ Pt+i)9t + Pt+i9t+i ■ (12) 


The above reasoning may sound plausible, but is it sound? In this section we construct 
a forward view for partially trusted interim targets, and then derive an equivalent span- 
independent backward view. It turns out the resulting update indeed changes the weights 
precisely as in (12|. 

In the forward view, the online weights that always trust the latest interim target fully 
will be denoted 9^ to differentiate them from the trusted interim weights 9^. If we have 
data up to horizon h but we trust the latest interim target only with degree Ph G [0,1], 
then the predictions prior to h should update towards Z^ only to this degree and for the 
rest, with degree (1 — Ph), fall back on earlier interim targets. Similarly we update towards 
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only as far as this interim target was trusted, so with total degree (1 — fih)(ih-i, 
and then further fall back to with total degree (1 — — Ph-i)l3h-2, and so on. 

If we do not trust any of the interim targets observed between the time t of making the 
prediction and the current horizon h, the predictions should remain wherever they were at 
time t. This can be achieved by updating the prediction at time t towards the then-current 
prediction cjijOf with the remaining degree (1 — /3?i)(l — Ph-i) • • • (1 — fit+i)- Note that 
the different multipliers sum to one: 

Ph + i^ — Ph)Ph-l + — Ph){^ — Ph-l)Ph-2 + + (1 ~ /d/i) ' ' ' (1 ~ /?t-|-l) = 1 • 

The total forward view for a prediction at time t with a horizon of h is therefore given by 

+ (1 - Ph)Ph-iatMZ’^-^ - (pj0t) 

+ (1 - Ph)(i - Ph-i)Ph-2atMZ’^-^ - cf>Je1) 


+ (1 - / 3 ,) • • • (1 - Pt+2)Pt+iatMZ*+^ - cpje^) 

+ (1 - / 3 ,) • • • (1 - a + 2 )(i - Pt+,)a^cPt{(f>lel - 4>]e^). 

(now grouping terms at4>ti‘ — 4>t with total weight equal to one) 

— 0^ + at4>t (^PhZ^+ 

{I- Ph)Ph-iZ’^-^ + 

{l-Ph){l-ph-l)Ph-2Z^-^ + 


(1 - /3,) • • • (1 - A+2)/?t+lZ*+l+ 

(1 - /3,) • • • (1 - a+2)(i - /3i+i)07 el - (t>Je>i^ 

= e’^ + atMz’^-4>le^) 

= + t = 0,...,h-l; h = l,...,T, (13) 

where Zf = PhZ'^ 

+ {1- Ph)Ph-iZ'^-^ 

+ (l-Ph){l-Ph-i)Ph-2Z’^-^ 

+ ... 

+ (1 ~ Ph) • • • (1 — Pt+2)Pt+lZ*''^^ 

+ (1 — Ph) • • • (1 — /3t+2)(i — Pt+i)4>t et 
= phZ’^ + {1-ph)Zt\ t = 0,...,h-l; h = l,...,T,and (14) 

zl = 4>Jel t = 0,...,T-i. (15) 
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To derive a span-independent variant of algorithm (131, we first identify how a general 
weight vector 0^ depends on the combined interim targets and the observed feature vectors 
by applying the recursive definition in (13) repeatedly, yielding 


= Fi_i (Ft-20t-2 + 0^t-24>t-2Zf_2) + 0:t-l4>t-lZf_i 
= Ft-lFt-20t-2 + Ft-lO't-2<Pt-2Zf_2 + CXt-l4>t-lZf_i 


(applying ([^ to ) 
(applying ^ to e^_i) 


— Fi_iFt_2(Ft_30(‘_3 -I- at-34>t-3Zt_^) + Ft-iat-2<t>t-2Zt_2 + at-i(f^t-iZi_i 

(applying ^ to 0 f_ 2 ) 

= Fi_iFt_2Ft_30(‘_3 -I- F(_iFt_2at_30(_3Z(^_3 -I- Ft-iat-24>t-2Z^_2 + o:t-i4>t-iZ^_i 


t-1 

= Ft_i---Fo6Q + 'y^ Ft_i ■ ■ ■ Ffc-|-iafc(/)fcZfc 

( ^ k=0 

= 0,t-l 

t-1 

= dt-i + ^^Ft-1 ■ ■ ■ Fk+iak4>kZ^ ■ (16) 

fc=0 


The last step uses the fact that the initial trusted weights 6q are equal to the initial online 
weights, such that 9q = 6q = 9q for any t, which means at is the same as before. Notice 
that we have not yet used the definition of Zt in any way; the derivation so far holds for 
any combined target. 

We now first examine if we can efficiently go down in the triangle, that is, to get to 
0^1 from 


h-1 


Qh+i I Fh-i ■ ■ ■ Fu+iau(t>kZl+^ 




^-1 


— j a/i-i + ^ F^_i • • • ¥k+iak^kZ^ j 




h-1 


- ^ F^_i ■ • • Ffc+iafc0fc - Z^) 


fc=0 

h-1 


from (p!6|)) 
from @) 
(merge sums, cancel ah-i) 


= Y, Fh-i ■ ■ ■ Ffe+iafe0fe (ph+iZ'^+^ + (1 - Ph+i)Zj: - Z^\ 


h-1 


(using (141 on ) 


= ^ F,,_i • • • Ffe+iafe0fe/3,,+i(Z'*+i - ZO 


fc=0 
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''h-1 


h-1 


=/3/t+i ^ F/i-i • • ■ Ffc+iafc^fc I — Ph+i 1 ■ • • Ffc+ia/cf/>/c^fc 


\k^0 


fc=0 


= ^h-l 

=/3h+ieh-iZ^+^ - f3h+i{e’il - ah-i) 


= 0^- a/i_i, from (16) 


— Ph+l {o,h-l + Bh-lZ^^^) — Ph+10h 

= from ^ 

(17) 

Thus 0^^^ can be written as a simple combination of the previous trusted weights 0^ and 
the interim online weights 0^^^ with 

<+i = (l-/3,+i)0;( + /3,+i0^i. 

(18) 

We can plug this value into the definition of 0 XX 1 find 


0lX\=Yh0X^^+ahZl+^cPh 

(using (|l^) 

= F,Y(1 - ph+i)0'!, + Ph+i0t^^'] + ahZX^^4>h 

(using ([^) 


= F;, (^(1 - /3;,+i)e^ + + ah {Ph+iZ^^^ + (1 - h+i)Z'^) c^h ■ 

(using ( |l4| )) 

= F;, (^(1 - ph+i)e^h + + an {Ph+iZ^+^ + (1 - Ph+iWhOl) <t>h ■ 

(using ( |I^ ) 

Now we group the terms depending on whether they are trusted (multiplied with Pn+i) 
or untrusted (multiplied with (1 — fih+i)) to simply further to 


e'lX\ = {^-Ph+i)[^h0h+ochMkOf^] + /3/.+1 ( F/,0;(+^ + 


lT nh 




7h-\-l 


using Fh^l-ahcl)h(i>h 

3/1 I Q 


= dhtl 0 


= ii-Ph+i)9fi + Ph+ie^h+i- 


All superscripts now match their corresponding subscripts and so we can write down an 
algorithm that is equivalent to the forward view in the sense that 0^ = 6t for all t, with 

e_i = 0, then e* = et-i + Q;t^t(l - 4>J et-i), t = 0,...,T -1, 

et+i = et + et{Z^+^-Z^)+atMZ^-4'j0t), t = 0,...,T-l, (19) 

0t+i = 0t + /3t+i(^t+i ~ 0t), t = 0,... ,T — 1, 

where Pt S [0,1] is the degree of trust we place in Z*. The first two lines compute the 
online weights, and are equal to the online backward view (11) from the previous section. 
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The last line effectively computes a weighted running average the online weights, according 
to the sequence 

Algorithm (19) subsumes the previous algorithms. If jSt = 1, Vt, then the predictions 
are equal those of the online algorithm on each step. If /?( = 0 for ts{l,...,T—1} and 
(3t = 1: then the predictions are equal those of the offline algorithm. As long as we trust 
the final outcome, such that /3 t = 1, then all these algorithms result in exactly the same 
weights at the end of the episode, and algorithm (19) allows us to be flexible about how 
much we change the predictions during the episode, without requiring us to commit to 
either fully online or fully offline updates for the whole episode. 


6 Bootstrapping 


So far we have always fully trusted the actual final outcome. All interim targets have 
been deemed irrelevant by the end of the episode, leaving no effect on the computed final 
weights. There are cases in which we do not want to discard all interim targets. For 
instance, consider a stock that crashes down just before the end of the year. Certainly 
our updated predictions should include the possibility of such a crash, but we may not 
want to predict it will always crash just before the year’s end. Similarly, suppose it rains 
on a certain date for which we want to predict the weather. It then seems wasteful to 
ignore the sunny weather on the days leading up to that date. These are examples of 
cases in which the interim targets are almost as informative as the final outcome. In some 
cases, an interim target may even be more informative. For instance, it may be due to 
a highly-trusted expert that takes into account all possible outcomes from that point in 
time. Surely, this expert should not be ignored completely in favor of one random final 
outcome. 

The general idea of updating predictions using other predictions, such as the interim 
targets, is called bootstrapping (see, e.g., Sutton and Barto 1998). One way to obtain 


persistence of interim targets in our final predictions, and to achieve bootstrapping, is to 
drop the requirement in the previous section that we trust the final outcome fully, and 
allow /St < I- The eventual target for our updates is then a weighted average of the 
interim targets and the final outcome. For instance, if (3 1 = 1/2 for all t G {I,...,r} 
the final updates will place a weight of Pt = 1/2 on the final outcome, a weight of 
(I — Pt)Pt-i = 1/4 on the interim target immediate before then, and so on. 

The notion of trust from the previous section applies a single degree of trust for an 
interim target uniformly to all prior predictions, but this is not always desirable. Using 
this definition of trust, if we fully trust an interim target then it replaces all earlier interim 
targets. However, even if an interim target if fully trusted for the most recent prediction, 
it may be inherently less trustworthy for earlier predictions. To illustrate this, consider 
flipping a coin three times and predicting the total number of heads. The possible final 
outcomes are 0, I, 2, and 3 heads. If a trusted expert tells us before the first flip that 
the coin is fair, that is equivalent to observing a trustworthy interim target of = 1.5. 
Suppose then the first two flips both result in heads, such that the only remaining possible 
final outcomes are 2 and 3. If the coin is indeed fair, an interim target of 2.5 would now be 
trustworthy target for the prediction made after observing two heads. However, we would 
probably not want to replace the earlier interim target of 1.5 for the first prediction. 
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Unfortunately, this is exactly what happens with the algorithm from the previous section. 

This suggests a different notion of trust, based on the degree of trust we place in an 
interim target as a stand-in for the expected final outcome rather than the actual (random) 
outcome. If an interim target precisely matches the expected outcome at that point in 
time, then later targets can then only be noisier or more specific but not more informative, 
and it should never be replaced by later targets. 

Suppose, concretely, that we fully trust under this new notion of trust. Then, the 
update for the prediction made at time t — 1 should disregard any targets that arrive later, 
including even the final outcome. Conversely, if we do not trust Z* at all, then it should 
leave no trace in the hnal updates. More generally we can update towards Z* with an 
intermediate degree of trust rjt G [0,1]. If the next interim target is trusted with 
degree ? 7 t+i, we then update our prediction at time t towards it with a total weight of 
(1 — r]t)r]t+i. The update towards will get a total weight of (1 — rit){l — r]t+i)rit+ 2 , 
and so on until we reach either the hnal outcome or the current data horizon. The latest 
interim target (and the hnal outcome) is always trusted fully until we move to the next 
time horizon. Therefore, at horizon h we always place any remaining weight on Z^ and 
update towards it with total weight (1 — 77t)(l — ryt+i) • • • (1 — rjh-i). 

The corresponding total update to the prediction at time t with a current horizon h is 
then 

dt) 

+ (1 - Tit+i)'nt+2at4>t{z*-+‘^ - (p’JOt) 

+ (1 - - Vt+2)vt+3atMZ*^^ - 

+ ... 

+ (1 - Vt+i) • • • (1 - r]h- 2 )Vh-iat(t>tiZ^-^ - cj)Je^) 

+ (1 - Vt+i) ■ • • (1 - rih-i)atcl)t{Z'^ - 07 ^t) 

= + . (20) 

where we have grouped the updates into a single update towards a combined target Z^, 
just as in the previous section. This update is perhaps more familiar when we change the 
notation slightly. For all t, we dehne Xt = 1 — rjt such that Xt essentially specihes to what 
degree we distrust Z*. The combined target is then defined as 

Z^ = (1 - At+i)Z*+i 

+ Xt+i{l-Xt+2)Z^+^ 


+ Xt+i ■ ■ ■ Xh-2{^ — Xh-i)Z^ ^ 

+ Xt+1 ■ ■ ■ Xfi-iZ^ . ( 21 ) 


This target Zj^ is known as a A-return (Sutton and Barto 19981. The version that truncates 
at the current horizon h was first proposed by van Seijen and Sutton (2014). The total set 
of updates is 


Sq — (^0 


t = 0,...,T; 
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( 22 ) 




= Ft0f 


0it4>tZ^ ■ 


i = 0,...,/i — 1; h = l,...,T. 


If we ultimately distrust all interim targets, then A* = 1 for all t and = Z^. The 
algorithm then reduces to the online algorithm from Section 4. Otherwise, at least some 
interim targets persist and contribute to the final weights. At the other extreme, if we 
trust all interim targets, then Aj = 0 for all t and Z^ = Then, the updates reduce 

to single-step updates that only use the immediate next interim target. In that case each 
update depends only on the immediate next time step and we can drop the superscript h 


and the forward view (22) reduces to an efficient span-independent algorithm 

dt+i — dt + — (pjOt ), t = 0,..., T. 


Apart from this special case, the forward view ( |22[ ) is computationally inefficient and we 
desire an efficient span-independent algorithm to get from 6^ to In the previous 

section we derived how a weight vector depends on any sequence of combined targets Z^, 
independent on the definition of those targets. We repeat the result of that derivation, as 


first given in (16), here for clarity: 


— CLt-i + • • • Fk+iak<pkZ^ t — 0, ... ,h — 1 ; h — 1 ,... ,T . 

k=0 

Because this equation holds regardless of the definition of Z^, we can apply it to the 
current algorithm. In particular we use it to try to find an efficient algorithm to go from 
to this is possible, we can then use the update (22) to go from to O^Xi- 

We start by writing out the difference as 


h-1 

• -Fk+lOlkfpkZX^^ 

h-1 

- 

/c^O 



h-1 




■ - Fk+lOtkfpk (ZX~^^ 

- zX) . 


' Fk+iat(f>kZX 

(a/i_i cancels) 

(23) 


The combined targets Z^^^ and Z^ share many terms: going back to (21) we can see that 
all interim targets up to Z^~^ will have the exact same multipliers. These terms cancel, 
and the remaining difference is given by 

ZX+^ - Zl = Afc+i • • • A,,_i(l - \n)Z^ + Afc+i • • • - A^+i • • • X^-^Z^ 


due to Z^^^ 

= Afc+i • • • A„ (Z'*+i - Z^) . 

We can then continue from ( [2^ with 

h-\ 

= • • • Fk+iak4>k {Zl^^ - ZX) 

fc=0 


due to 


( 24 ) 
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(using p4| )) 


h-l 

= Y, • • • Fk+iakcf>kXk+i ■ ■ ■ Xh{Z^^^ - Z^) 

fc=0 


= Xh (^ F/j_i • • • Ffc+iAfc+i • • • j 


\k=0 


= ^h-l 


= Xheh-i{Z'^+^ - Z>^). 


Z^) 


(25) 


We again encounter the TD error Z^^^ — Z^ and, more importantly, a new trace vector 
Bt that can be updated efficiently with 

t 

Fi • • • ■ ■ • \tOik<i>k 

t-i 

— Fi • • • F/e+lA/c + l ■ ■ * \tOCk<Pk + OLt4>t 

t-1 

= AiFi Fi_i • • • F/e+iAfc+i • • • \t-iak(j>k + 

et-i 

= XtF tBt-i + at4)t (26) 

= Atet_i + at4>t{^ — Xt4>t 6t-i) • 


This trace is similar to the one we encountered before, but with the difference that the 
value of the vector decays towards zero by multiplication with Xt on each step. Predictions 
made prior to a fully trusted interim target (for which At = 0) will never be affected by 
later interim targets because the trace vector is set to zero. The extent to which the trace 
extends backward in time depends on the extent to which we have not yet trusted the 
corresponding interim targets. 

We now combine the derived update from 6^ to with the update from to 
(^h+i derive a single efficient update, given by 


e'f.Xl = + ahcl>hZ'^+^ 

= F^ + Xheh-i{Z^+^ - Z’^)) + akcl,hZ'^+^ 

= Fn0'^ + XhFheh-i{Z'^+^ - Z^) + ahij^hZ'^^^ 

= FhO'j^ + {Eh - aM {Z^^^ - Z^) + ah<i>hZ^+^ 

(using A;iF,je;i_i = et 

= Fne'f, + en {Z'^+^ - Z^) + ahcf>hZ^ 

= (I - ahMDe'fi + Eh {Z^+^ - Z^) + ahct>hZ^ 

= e>}, + Eh {z>^+^ - z^) + ahct^h {Z^ - 4>Je’^) . 


(using ( |2^ ) 
(using ( |2^ ) 


ah4>h, from (26)) 


This concludes our derivation because the value of O^XX is now defined fully in terms 
of the previous weights on the diagonal 6^ and other quantities that are either directly 
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available upon reaching our new data horizon h+1 or, in the case of the trace vector, can 
be computed with constant 0{n) computation per step. The span-independent algorithm 
with persistent interim targets is 


e_i — 0 , then — XtEt-i -\- — \t4>t t — 1,..., T—1, 

Gt+i = et + et - Z‘) + atMZ* -cl}j0t) t = 0,..., T-l. 


(27) 


Compared to the online backward view the only difference is the appearance of 

At in the update of the trace. If At = 1 for all t, we regain the online backward view 
precisely, demonstrating that the new algorithm is strictly more general. In contrast to 


the averaging backward view (19), from the previous section, we see that for the notion of 


trust corresponding to A we do not need to maintain separate online and trusted weights. 
Instead, the degree of trust is used to scale the trace vector down accordingly. If At < 1 
for any ts {l,...,r}, the corresponding interim targets have a lasting effect on the final 
weight vector and therefore for the first time we may obtain predictions that differ not 
just during the episode but also at its end. 


7 Combining two notions of trust and the emergence 
of averaged TD(A) 

Algorithm ( |27| is a strict generalization of the online algorithm © , but it does not 
subsume the offline algorithm or the averaging algorithm that switches smoothly 
between online and offline updates. In this section, we combine the ideas from the last two 
sections to arrive at an algorithm that generalizes and subsumes all previous algorithms, 
thereby unifying all that came before into a single, general-purpose algorithm. 

An offline version of the TD(A) algorithm can be obtained by using the online algorithm 
in (271 to update an online weight vector 6^ and then defining the trusted weight vector 


Of to remain equal to the initial weights until the last step, at which time we replace them 
with the online weights. An algorithm that switches smoothly between the offline and 
online cases can then be obtained similar to before, resulting in 


e_i = 0, then e* = \tet-i -I- et-i), t = 0,... ,r - 1, 

Ot+i ^Ot + et + atct)t{Z* - (f>lOt), t = 0,..., T - 1, 

0t+i — 0t + ,dt+i(6t+i — 0t), t = 0,..., T — 1. 


(28) 


The first two lines are the online algorithm (271, from the previous section. The last line 


is equal to the last line in the unified algorithm without persistency of interim targets. 


as given in (19), but now using the online weights that use persistent interim targets 


weighted according to A-returns, as computed in the first two lines. When At = 1 for all 


t we regain the averaging algorithm (19) without persistent interim targets. When Pt = \ 


for all t we regain the online algorithm (27) with persistent interim targets. So, we have 
again successfully unified all previously seemingly difference approaches to trust and have 
arrived at a single general algorithm that subsumes all that came before. 


The merits of A-returns are well known (Sutton 1988 Sutton and Barto 1998) but the 
/3-weighting of the online weights is novel to this paper, and it is appropriate to discuss 
it in a little more detail. So far we have considered only a single episode, but a major 
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potential benefit of including (3 appears when we consider multiple episodes. For clarity, 
consider the extreme case where all episodes last only a single step. Then A plays no role 
because there is no interim within each episode; there is only a beginning and an end. If 
we use m to enumerate episodes, such that is the true outcome of episode m, then the 
updates for single-step episodes are 

^m+l — T 7 

^m+1 — T /5m+l(^m+l ^m) : 


where we used the fact that the final weights of episode n are the first weights of episode 
m + 1. Using /3, we can do something here that we cannot do with A alone: we can 
weight the relative impact of different episodes. For instance, we can choose to keep 
the weights and the predictions stationary over multiple episodes, by setting jSm = 0, to 
reduce the impact of the final outcomes observed in those episodes on our predictions. 
Another possibility is to decay the trust, for instance according to /3m = such that 
we trust the outcome of the first episode fully (/3i = 1) and then reduce the trust for 
each subsequent episode. Such a choice of /3 makes sense if we view the trust we place in 
outcomes as being relative to the trust we place in the predictions we already have. As 
our predictions improve over time, the outcomes become relatively less trustworthy. With 
this definition of trust, the trusted weights are the average of the weights of all previous 
episodes: Om = 
according to the averages 0 


g This specific algorithm is interesting because the predictions 
are known to converge to the optimal predictions faste r than 


the predictions according to any sequence of online weights 6m (Polyak and Juditsky 1992 
Bach and Moulines 2013). This shows that the notion of trust as provided by /3 gives us 
something that cannot be obtained with A alone. To our knowledge, algorithm (28) is 
the first to generalize this idea of averaging online weights, in a principled fashion, to 
long-term predictions. 


8 Generalizing to cumulative returns and soft termi¬ 
nations 

In this section, we discuss how to extend our algorithms to handle soft terminations and cu¬ 
mulative returns. Both extensions generalize the episodic final-outcome setting considered 
above, and the algorithm we derive in this section will subsume all previous algorithms. 

Often, we want to predict the cumulation of a signal {Xt}f^Q rather than a single final 
outcome. In the episodic setting, with termination at time T, we then aim to predict 

Zj' = Xt+i + Xt+2 -b ... -b Xt , 

where, as before, t is the time of the prediction. In contrast to final outcomes, these 
cumulative outcomes depend on the time step of the prediction because at later time 
steps there will be less signal left to accumulate before the episode ends. We call this 
time-dependent outcome the cumulative return. 

We may wish to update our predictions online, before observing the full cumulative 
return. To do so, we need to define interim targets to temporarily take the place of the 
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actual return. An interim target up to a horizon h < T should in any case include 
the part of the signal that was already observed. In addition, we introduce a residual 
prediction Ph that stands in for the unseen part of the signal, from horizon h to the end 
of the episode T. The full interim target at time t up to horizon h is then 

Zt = A(_|_i + Xt+2 + ... + Xh + Ph ■ 

The residual prediction Ph may for instance be given by an external expert or by our own 
predictions at time h. For now we are agnostic to its origin and consider the general case. 
In any case, we define Pt = 0 because at termination there is no remaining signal left to 
predict. If all signals except the one coinciding with termination are zero, we regain the 
final-outcome setting where the last signal Xt = Z takes the role of the final outcome. If 
we further define the interim target at each horizon to be equal to the residual prediction 
at that horizon, such that = Ph, we are back in the online final-outcome setting. 
Therefore, cumulative returns are strictly more general than final outcomes. 

So far, we have considered episodic predictions where each episode ends with a single fi¬ 
nal outcome that we wish to predict. The algorithms extend naturally to multiple episodes 
by using the final weights of the episode as the initial weights of the next episode. How¬ 
ever, some predictive questions do not fit nicely into this strictly episodic format because 
they are better thought of as terminating softly on each time step. 

A soft termination is a conceptual and potentially partial termination of the signal. 
Such a soft termination could represent a probability of termination, for instance when we 
want to take into account the probability of a robot breaking while learning in a simulation 
in which it never actually does. Or the soft termination could represent a desire to trade off 
the imminence and the magnitude of a signal, for instance when we do not just want more 
money rather than less, but we also want it sooner rather than later. In both cases the 
prediction is about a diminishing version of the ‘raw’ signal (e.g., money). Here, we are not 
concerned with the potential reasons for using soft terminations and consider the general 
case where the termination of the prediction can vary per time step and be anywhere 
between full continuation and full termination. Soft terminations allow us to ask more 
general predictive questions, and even to simultaneously consider multiple predictions that 
may resolve at different times. 

Soft terminations can be modeled by using a continuation parameter 7 ^ G [0,1] to 
denote the amount of termination of our prediction upon reaching time t. This quantity is 
often called a discount factor, because it discounts the impact of later outcomes compared 
to earlier ones. If 74 = I, no termination happens at time t; if jt = 0, the prediction 
terminates fully, even if the trajectory may continue. We consider general sequences 
of 7 t and only require that eventually every prediction resolves completely, potentially 
asymptotically, such that 7i = 0 for all t. The episodic setting considered in the 
previous sections is a special case where 7 t = 0 only if t = T is the final step of the 
episode, and 7 * = 1 on all other steps. 

We are now almost ready to formulate the target for a prediction about a discounted 
cumulative signal. In order to maintain full generality and compatibility with previous 
sections, we immediately include persistency of the residual predictions according to a 
sequence of {At}^Q and trust of the resulting combined targets according to a sequence 
{/3i}“o' For ^ny horizon h > t, the combined interim target for the prediction at time 
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t should include at least the immediate next signal Xt+i. Beyond this first signal we 
continue with 74+1 and observe the residual prediction Pt+i- We trust this prediction with 
degree 1 — Aj+i as a stand-in for the expected cumulative discounted return, and so its 
total multiplier is 74 + 1(1 — A 4 + 1 ). If we trust this prediction fully, so that A 4 = 0, there 
is no need to continue further. Otherwise, we continue for as much as we have not yet 
terminated according to both 74+1 and A 4 + 1 , so that the next signal Xt +2 gets a total 
weight of 74 + 1 A 4 + 1 . This line of reasoning then continues until we reach the horizon of our 
current data at time h, at which point we place any remaining trust on the most recent 
residual prediction Ph- All together, this gives us a combined target for the prediction at 
time t with data up to time h > t: 

Zt = Xt+l + 74 + 1(1 — A4+i)P4+1 

+ lt+l\+l{Xt+2 + 74 - 1 - 2(1 — A4+2)T4+2) 

+ 7i-l-l^t-l-l7i-l-2A4+2(A4+3 - 1 - 74 + 3(1 — A4+3)P4+3) 


+ 7t-l-l ■ ■ ■ lh-2^t+l • • • ^h-2{Xh-l + 7ft-l(l ~ ^h-l)Ph-l) 

+ 74-1-1 ■ ■ ■ 'lh-l^t+1 • • ■ ^h-l{Xh + JhPh) ■ (29) 

This can be written recursively as 

Zt = A 4+1 -I- 74+1 ((1 — A 4 +i)P 4 +i + , (30) 

and Z* = Pt , (31) 

Each of these interim targets is trusted according to the trust associated with its horizon, 
Ph- Recall from Section that this trust propagates backward. If we trust the last target 
fully, there is no need to consider earlier targets. Otherwise, we multiply Z^ with 
its associated trust j3h and continue to the previous target with degree (1 — j3h)- This 
then continues until either we find a target we fully trust, or we reach the then-current 
predictions cj)J 6^ at time t, when we made the prediction we are currently updating. Our 
final interim targets are then given by 

Zt = l3hZ^ 

+ (1 - /3h)/3h-iZt" 

+ il-l3h)il-/3h-i)/3h-2Zt^ 

+ (1-/3„)---(1-/34+2)/34+i^4‘+' 

= PhZ^ + (1 - h)Zt" (32) 

and Zl = 07 el (33) 

where 6l are the trusted weights for time t that we will compute. The forward-viewing 
update is then given by 

6»‘=6/o t = o,...,r 
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= + atcl)tZ!^ , 


t = 0,...h — l-, ft, = 0, ...,T, 
t = 0,...h—l] h = 0,...,T. 


( 34 ) 


To find an efficient backward view, again we can start by writing down explicitly the 
value of Of. Following a derivation identical to the one for when we first consider trusted 


weights, as shown in (16) but with the complex target replacing we obtain 


Oj — Qt-i + Ft_i ■ ■ ■ Ffc-i-iafc^fcZfc t —1; h —(35) 

k=0 


As before, we can apply this to both and to find the difference 

h—1 h—1 

0h+i _gh^J2Yh-i---Fk+iak4>kZ^^^ - ^Fft_i-..Ffe+iafc</.fc4^ 




h-1 


k^O 


{ah-i cancels) 


= ^ F;,_i • • • Fk+iak4>k ( 

h-1 ^ 

= ^ F ;,_1 • • • Ffc+iafe(/.fe f + (1 - ^h+i)Zl - Zl ) (from 

k^O ^ 

h-1 . 

= ^ F;,_1 • • • Ffc+iafc</.fc/l;,+i ( - Zl 


fc =0 


h-1 


h-1 


=/3h+i ^ F?i_i • • • Ffc+ia/j</)/jZ^'''^ — Ph+i F/t_i ■ ■ ■ Ffc-i-iOfe^fc^fc 


k=0 


k=0 


= 0^1- ah-i, from (35) 


h-l 


= Ph+i I o,h-i + ^ F?i_i • • • Fk+iak4>kZ^ 


h+l 


- Ph+lO^H- 


(36) 


fc =0 


The first part of this result, within the brackets, looks familiar: notice the similarity to 
(35). The term can be interpreted as the intermediate result of a forward view with 


targets Z^, as defined in (30), and updates defined by 


01 = 00 


t = 0,...,T 


0^^, = 0^ + atMZt - 01), t = 0 ,...h-l; h = o,...,r. 


(37) 


This is an online algorithm, comparable to the algorithm for A-returns derived before, 
but implicitly including cumulative returns and discounting through the definition of Zh 


The complete forward view (34) can then be interpreted as switching smoothly between 


this online algorithm and an offline algorithm that in the extreme can delay updating 
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the predictions indefinitely. Analogous to the interim weights shown in (351, the interim 
weights of this online algorithm satisfy 


t-i 


• • • Fi;^iak<pkZl^ t — 0,..., /i; h — 1,... ,T . (38) 


fc =0 


We can then continue our derivation from (36) with 

/ h-l 


0h+i _ gh _ j a^-i + F^_i • • • ¥k+iOLk4>kZk^^ j — Ph+i^ 


k=o 


= using (1^1 


= /3,+i 


This implies that if we have 6^ and we can then compute efficiently using 

<+i = (l-/3,+i)0;(+/3,+i0^^ (39) 

For now, we ignore the question of how to obtain 6^^^ and first focus on how to go from 
^h+i resultingly, from 0^ to O^+i- We can now derive 

(using @) 
(using ( |3^ ) 


= Fjph+i0k^^ + (1 - /3/.+i)0^ ) + ak<t>hZj:+^ 


= Fk + (1 - /3/.+i)0;() + akct>h [Ph+iZl+^ + (1 - Ph+i)Z'f^ 


(using ( |^ ) 


= F„ [ph+iel+^ + (1 - Ph+i)ei^ + ah4>h [Ph+iZl+^ + (1 - 


(using ( |33| ) 


= j3h+i + (1 - Pk+i) (^F„e^ + akMlei^ 


= dkXi-’ ^ 

= Ph+ief,Xl + (1 - /3„+i)0;(. 


= df. 


(regrouping) 


We see that 9k~^^ is not needed and instead we can use OkXr Therefore, if OkXi 
computed with constant computation independent of span, then O^Xl be computed 
from 9^ efficiently as well, using 

e'i,Xl = i^-l3h+i)e';; + /3k+i9';;+l. (40) 
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It now remains to be shown that the online weights can be computed efficiently. 
We start with the difference between 0^'^^ and 9^ for which we can use (38) to obtain 


h-l 

• • • Fk+iak<t>k - Z’i) . 

fc =0 


(41) 


We can find the difference from the definition in (29). All terms not involving 

Ph, ^/i+i or Ph+i cancel, leaving us with 


Z^^^ — Z^ — 7fe+i • • ■ 7/jAfe+i • • • Xh-i ((1 — Xh)Ph + + Xfijh+iPh+i) 

^ -y 

from 

7fc+i' ’ ’ ^hXk-\-i'' * Xii—\Pii 

'• -V-' 

from Z^ 

= Ik+l • • ■ IhXk+l • • ■ A;,_i { — XhPh + XhXfi+i + Xh'yh+lPh+l) 

= Ik+l • • ■ IhXk+l • • ■ Xh (Xh+l + "/h+lPh+1 — Ph) ■ 


Here we see the emergence of a general form of the classical temporal-difference error for 
cumulative discounted returns: Sh = Xh+i + jh+iPh+i — Using this, we can now 

continue from (41) with 


h-l 


=E F'*-! ■ ■ • - zj:) 

h-l 

= ^ • • •Fk-\-iak4>k"fk-\-l ■ • • 7h^k-\-l • • • ^hSh 


/c=0 


/h-l 


= Ih^h [ ^ 7fc+l ■ * ‘ lh-l^k-\-l • • ■ ^h-l^h-1 ' ' ' ^k-\-l0^k<Pk j Sh 


\k^0 


= ^h-1 


— 'Yh^h^h-lSh ; 


(42) 


where, similar to before, the trace vector St can be updated recursively according to 

t 

^ 7fc+i * * ■ lt^k-\-i • • * 

k^O 

t-i 

= 7fc+i * * ‘ 7i'^A:+i • • * AfFi • • • + at4>t 

k^O 

^If we use our current predictions as interim targets, the TD error is 6t = Xt+i +7t+i^t+i^t — <pjOt—i. 
This TD error uses the weights at two consecutive time steps and is therefore slightlt different from 
the classic TD error defined, using only the current weights Ot, as St = Xt+i + — (pjOt. 

The difference is important to achieve exact equivalence, although it is also possible to rewrite the new 
algorithms to use a more standard TD error. 
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t-1 

= 7t-^tFt ^ 7fc+i • • • 7t_iAfe+i • • • At_iFt_i • • • + at(f>t 

fc=0 

= jtXtFtet-i + at4>t (43) 

= 7tAtet_i + at4>til - 7tAt</>7^t-i) ■ 


The trace now decays according to both A and 7 . Using this trace vector, we can compute 


h+l 


e 


and derive 


efficiently from using (42). We now combine this with the hnal step to from 




h 

'h^h 


— Fh.O^'^^ + aii4>h (Af/t+i + 'Jh+iPh+i) 


(using (|3^) 
(using ([^, (|^) 

= F/j ^9^ + 'yfiXh^h-i {Xh+i + 7/i+iP?i+i — Ph)^ + Oih4>h (Xh+i + 7ft,+iP;i+i) 

(using (|T 

= Fh^h + IhXhFh^h-l {X^h+1 + Ih+lPh+l ~ Ph) + 0.h4>h i^h+l + Ih+lPh+l) 

— Fh^h + i^h — Oth4>h) {Xh+1 + 7?i+l-f’;i+l ~ Ph) + Oth4>h {Xh+1 + 'yh+lPh+l) 


(using 'jhXhFheh-i = eh - ah4>h, from ( |43| )) 
= Fhdh + eh {Xh+l + Jh+lPh+l — Ph) + C^hfj^hPh 
= (I ~ C^h’Phfp'h )^h + ^h {^h+l + lh+lPh+1 ~ Ph) + Oih<PhPh 
= 6h + eh {Xh+l + Jh+lPh+l — Ph) + Oih<Ph (yPh — 4>h^h^ 

= Oh + ehSh + Oih4'h (j^h ~ (pi ^hj ■ 

All weights again have matching sub- and superscripts and an equivalent TD algorithm, 
in the sense that 6t = 9\ and Ot = Of, for the fully general case including cumulative 
discounted returns is given by 

e_i = 0 , then e* = jt^tet-i + at(pt{l - JtXtCpJ et_i), t = 0,..., T - 1, 


St+i — 6t + etSt + at<pt{Pt — (pj St), 

St+i — St + Pt+i{St+i — St), 


t = 0,...,T-l, (44) 

t = 0,...,T- 1. 


The first two lines constitute the online algorithm we just derived. The last line is from 


(40) and extends this algorithm to include smooth switching between offline and online 


updates. If = 1 for all t, the algorithm reduces to a variant of TD(A) known as true 


online TD(A) (van Seijen and Sutton 2014), but extended to include general, potentially 


non-constant, sequences of {at}, { 74 } and {At}. The extension to averaging according to 
Pt is new to this paper. 

Soft termination generalizes the episodic setting we considered previously. This means 
that as far as the learning update is concerned, we do not have to treat steps on which 
the process actually terminates and restarts for a new episode in any special way. To 
see how this works, we first renumber the time steps on consecutive episodes: if the hrst 
episode ends at time T, the initial time step of the second episode will be taken to be 
T rather than 0. If the second episode lasts T' steps, the third episode is then taken to 
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begin on T + T' , and so on. Together with the requirement that 7 t = 0 on every actual 
termination, this is sufficient to get updates that are completely equivalent to treating the 
subsequent episodes completely separately. Notice that the update on termination at some 
time T, and resulting in 9t, uses the residual prediction on termination Pt only when 
multiplied with 7 ^. Previously we required that Pt = 0 because there is no further signal 
to predict. This is now no longer necessary, because 7 t = 0 already fulfills the requirement 
that 'jtPt = 0. If for instance we use our current predictions, such that Pt+i = 
for all t, we can simply keep the update as is even though Pt will then be the prediction 
for the cumulative return of the next episode, because </>t is now defined to be its first 
feature vector. Therefore, both hard and soft termination can be handled seamlessly using 
algorithm (44), and that will be our final, general algorithm. 


9 Convergence Analysis 


The algorithm ( [44| differs from related earlier algorithms such as TD(A) in a few subtle 
but important ways. The most notable differences are the updates to the traces and the 
averaging due to /?(. Known results on convergence therefore do not automatically transfer 
to this new algorithm and it is appropriate to take a moment to analyze it. 

The convergence of the trusted weights 9 depends on the convergence of the online 
weights 9 and so we must investigate these jointly. The online weights, in turn, depend on 
the sequences of parameters and residual predictions that are supplied. We want our anal¬ 
ysis to be general, which means we want to be able to handle general sequences of discounts 
{ 7 t}“i, persistency parameters, residual predictions {Pt}^^. Naturally, if 

any of these can change completely arbitrarily, we can have no hope of converging to any 


predeterminable solution. Therefore, as in Sutton et al. (2014), we allow the features 


discounts and persistency parameters to be stationary functions of an underlying unob¬ 
served state, such that (pt = It = liSt) and At = A(S't) for some fixed functions 

0 : 5 —>■ K”, 7 : iS —)■ [0,1] and A : 5 —>■ [0,1], where 5 is a state space and St G S is the, 
unobserved, state of the world at time t. We assume there is a steady-state distribution 
over these states such that all expectations used below are well-defined with respect to a 
distribution over states defined by limt_>oo Pr(5't = s). This setting generalizes the more 
standard approach where 7 t = 7 and At = A are constants, because now these parameters 
can still change over time, but it avoids the possibility of arbitrary non-stationarity that 
would ruin convergence. 

We first consider convergence when the residual predictions are also due to a fixed 
function of state, for instance because they are due to otherwise stationary experts or 
oracles. 

Theorem 1. Let Xt = X{St), cpt = (p{St), It = l{St), Xt = X{St) and Pt = P{St) all 
be fixed functions of (unobserved) states St € S, with a stable steady-state distribution d. 


Then, ifYl^o^t — ^ algorithm (44) converges almost 

surely to the fixed-point solution 

9,=E[cPtcptr"nzf^cPt] ■ 

Proof. We start by analyzing the online weights 9t. Because of the equivalence of the 
forward and backward views, we can investigate the forward view, which is easier to 
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analyze. In other words, instead of investigating limi_ 5 .oo Ot as updated through (44), we 


investigate liuit^oo updated through (37). By construction, the end result is exactly 


the same. The asymptotic forward view as the horizon goes to infinity is 


noQ 


= er + atMzr-<l>Jon: 


where = lim . 

h—¥oo 


t = 0,... , 
t = 0,... . 


Because does not depend on the weights, this is a standard stochastic gradient-descent 
update Ot+i = Ot — at'^gl{9)\g on the quadratic loss function 


1(6) =E {zr-ci)Jef 


If the step sizes are suitably chosen, for instance such that ott = oo and oit < ^ 

(Robbins and Monro [T951 ), and if the means and variances of Zf^ and 4>t are well-defined 
and bounded for all t, this update converges to the fixed-point solution 0* that minimizes 
the quadratic loss (cf. Kushner and Yin|2003]), such that 


hm 0* = 0, =E[</)t</)7] £[</.*Z; 

i—>-oo 


n 


It is straightforward to see that 9t will have the same limit; it suffices to have = 

oo. □ 

Although convergence is already guaranteed when /3t = 1 for all t, recent work has 
shown that for similar stochastic gradient algorithms the optimal rate of convergence is 
attained if Pt decreases much faster than a*, specihcally when f3t = 0{t~^) while at = 


a for some constant a (Bach and Moulines 20131. More generally, it seems likely that 
convergence also holds if Pt < ^ observation that j3t should 

perhaps decrease over time for faster learning may seem at odds with our introduction of 
this parameter as a degree of trust. However, these two views are quite compatible if we 
consider j3t to be the degree of trust we place in the online updates relative to the trust 
we place in our current predictions due to the trusted weights. When the trust in the 
predictions increases over time, the relative trust in the inherently noisy online targets 
should then decrease. 

Although Theorem 1 is already fairly general, it does not cover the important case 
when the residual predictions additionally depend on the weights we are updating. It 
makes sense to use the predictions we trust most and therefore we now consider what 
happens when Pt = cf)J 0t_i. Notice that we have to use 0t_i rather than 0t, because Pt is 
used in the computation of 0t and so the latter is not yet available when we compute Pt- 
The analysis of this case is more complex than the previous one, because Pt is no longer a 
constant function of state. This means the update is no longer a standard gradient-descent 
update on a quadratic loss, because the target Zj^ for the online weights itself depends on 
the trusted weights that we are simultaneously updating. 


The results by Bach and Moulines (2013) on stochastic gradient descent indicate 
that perhaps the most interesting case is where j3t decreases faster than oj, such that 
limt_>oo Ptjott = 0. This suggests an analysis on two time scales is appropriate. 
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Theorem 2. Let Xt = X{St), 4>t = 4‘{St), 7t = 7 (<S't) and Xt = X{St) all be fixed hounded 
functions of (unobserved) states St € S, with a stable steady-state distribution d. Define 
Pt = cfjdt-i. Then, ifYlT=o^t = “t < linit^oo ^ = 0, 

algorithm (44) converges almost surely to the TD fixed-point solution 0* that minimizes 


the mean-squared projected Bellman error (Sutton, Szepesvdri, and Maei 2008' Sutton et 


al. 2009), such that 


E[{Zt{e,) - 4>Je,)(f)J]E[(t)tci^tr^E[<t>t{Zti0*) - = o, 


(45) 


where 


Zt{0) — Xt+i + 7i+i(l — Xt+i)<fiJ0 + Zt{0), y0 ■ 


Proof. In two-time-scale analyses, we are allowed to analyze the faster updates as if the 
slower updates have stopped. This means that in analyzing the updates to the online 
weights 0t, we can assume the trusted weights 0t are constant to analyze where 0t converges 
towards as a function of the stationary 9t. On the other hand, when we analyze the slower 
updates to the trusted weights 0t we are allowed to assume the faster updates to the 
online weights converge completely between each two steps. For more detail on analyzing 


stochastic approximations on two times scales, we refer to Borkar (1997), Borkar (20081 


Kushner and Yin (2003), and Konda and Tsitsiklis (2004). 


We first analyze the convergence of the faster updates to the online weights, where 
we can assume that the trusted weights are stationary at some value 0. Then, using 
Pt = (f>J 0, the targets for the updates of the forward view are 

zl{0) = cfje, 

Zti0) = Z.t+1 + 7t-i-i(l — \+i)4>t 0 + lt+iXt+iZ'(j^i{0 ), 


where we have extended the notation slightly to make the dependence of Z^{6) on 9 
explicit. Notice that the residual predictions on each time step depend on the same 
stationary trusted weights 9. Because of the assumed stationarity of 9, the updates to 
the weights 9(( of the forward view can again be considered standard stochastic gradient 
updates and therefore these weights converge towards the fixed point 9^{9), where again 
we make the dependence on 9 explicit, defined by 

k{9) = E[cft(f>lY^E[ct>tZr(0)] , 

where Z^{9) = lim/j_>.oo Z(({0) denotes the limit of the target of the update as the horizon 
grows to infinity. In an episodic setting, Zf° = Zj for all t < T, where T denotes the first 
termination after time k. More in general, Z^ is always well-defined because we require 
nEo7* = o. 

For the analysis of the slower updates to 9t, we can now assume the faster time scale 
has already converged to its fixed point 9f,{9t) for the current weights. Therefore, we 
analyze the update 

0t+i = 0t S- (it+i{0*{0t) — 0t) 

= 9t+fit+i{E[4>u<t>lY"E[cfuZr{9t)] - 9t ). 
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This is a stochastic-approximation update that, under the conditions that Pt = oo 

and < oOj converges almost surely to the hxed point 0* that satishes 

e, = E[ct}tcf>J]~"E[ct,tZr^{e,)]. 

If we multiply both sides with E[(f>t(t)J ], this implies that E[(f)tcj)J 0*] = E[(f>tZ^{6i,)] and 
therefore, by moving both terms to the same side and then multiplying with 

E[cl,t<t>J]~"E[4>t{Zr{O.)-(pj0*)]=O. 

It follows immediately that 0* minimizes the mean-squared projected Bellman error com¬ 
pletely to zero, as desired. □ 


10 Discussion 


In this paper, we have considered how to answer predictive questions with algorithms that 
use constant computation per time step that is proportional to the number of learned 
weights, and that is independent of the span of the prediction. We considered both final 
and cumulative outcomes, under online and offline updating, with and without persistency 
of the residual predictions we encounter during an episode, and with hard and soft termi¬ 
nation. In the end, we obtained a single general algorithm that can be used for all these 


different predictive questions, which is shown in (44|. This algorithm is guaranteed to be 


convergent under typical, fairly mild, technical conditions. 

Some extensions remain for future work. In particular, we have not considered how 
different policies of behavior can influence our predictions, and as a result have not talked 
about the problem of control in which the goal is to find the optimal policy for a given 
(reward) signal. Our analysis already extends naturally to the prediction of action values, 
from which control policies can be easily distilled. Then, using a form of policy iteration 


(Bellman 1957 Howard 1960), we can repeatedly switch between estimating and improv¬ 
ing the policy to tackle the problem of optimal control. However, to properly and fully 
include adaptable policies, we would in addition need to carefully consider the problem 
of learning off-policy, about action-selection policies that differ from the one used to gen¬ 


erate the data (Sutton and Barto 1998). This is consistent but orthogonal to the ideas 
outlined in this paper, and such off-policy predictions (including those about the greedy 
and, ultimately, optimal policy) are learnable through a proper use of rejection sampling. 


as in Q-learning, or importance sampling (Precup, Sutton, and Singh 2000 Precup and 


Sutton [2001[ Maei 2011 Sutton et al. |2014[ van Hasselt, Mahmood, and Sutton 2014 
Mahmood, van Hasselt, and Sutton 2014). 


All algorithms considered in this paper are in a sense descendent from a linear stochastic 
gradient, or LMS, update. The main idea of span-independent computation is more general 
and can be applied quite naturally to other settings, including for instance non-linear 


functions such as deep neural networks (LeCun, Bengio, and Hinton 2015 Mnih et al. 


or to quadratic-time linear-function algorithms as in LSTD (Bradtke and Barto 1996). Not 


2015) 


all updates may have fully equivalent span-independent counterparts, but even then it may 
be more important to be independent of span than to be exactly equivalent. 


30 























References 


Bach, F. and Moulines, E. (2013). Non-strongly-convex smooth stochastic approximation 
with convergence rate 0(l/n). Advances in Neural Information Processing Systems 26, 
pp. 773-781. 

Bellman, R. (1957). Dynamic Programming. Princeton University Press. 

Borkar, V. S. (1997). Stochastic approximation with two time scales. Systems & Control 
Letters 29(5), pp. 291-294. 

Borkar, V. S. (2008). Stochastic approximation. Cambridge Books. 

Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal dif¬ 
ference learning. Machine Learning 22, pp. 33-57. 

Howard, R. A. (1960). Dynamic programming and Markov processes. MIT Press. 

Konda, V. R. and Tsitsiklis, J. N. (2004). Convergence rate of linear two-time-scale stochas¬ 
tic approximation. Annals of applied probability 14(2), pp. 796-819. 

Kushner, H. J. and Yin, G. (2003). Stochastic approximation and recursive algorithms and 
applications. Vol. 35. Springer Science & Business Media. 

LeCun, Y., Bengio, Y., and Hinton, G. (May 2015). Deep learning. Nature 521(7553), 
pp. 436-444. 

Maei, H. R. (2011). Gradient temporal-difference learning algorithms. PhD thesis. Univer¬ 
sity of Alberta. 

Mahmood, A. R., van Hasselt, H. P., and Sutton, R. S. (2014). Weighted importance 
sampling for off-policy learning with linear function approximation. Advances in Neural 
Information Processing Systems 27, pp. 3014-3022. 

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, 
A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, 
A., Antonoglou, L, King, H., Kumaran, D., Wierstra, D., Legg, S., and Hassabis, D. 
(2015). Human-level control through deep reinforcement learning. Nature 518(7540), 
pp. 529-533. 

Montemerlo, M. and Thrun, S. (2003). Simultaneous localization and mapping with un¬ 
known data association using FastSLAM. In: IEEE International Conference on Robotics 
and Automation, 2003. Vol. 2. IEEE, pp. 1985-1991. 

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by 
averaging. SIAM Journal on Control and Optimization 30(4), pp. 838-855. 

Precup, D. and Sutton, R. S. (2001). Off-policy temporal-difference learning with function 
approximation. In: Proceedings of the eighteenth International Conference on Machine 
Learning. Morgan Kaufmann, pp. 417-424. 

Precup, D., Sutton, R. S., and Singh, S. P. (2000). Eligibility traces for off-policy policy 
evaluation. In: Proceedings of the Seventeenth International Conference on Machine 
Learning. Morgan Kaufmann, pp. 766-773. 

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of 
Mathematical Statistics 22(3), pp. 400-407. 

Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning. PhD thesis. 
University of Massachusetts. 

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine 
Learning 3, pp. 9-44. 


31 



Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT 
press, Cambridge MA. 

Sutton, R. S., Szepesvari, Cs., and Maei, H. R. (2008). A convergent 0(n) algorithm for 
off-policy temporal-difference learning with linear function approximation. Advances 
in Neural Information Processing Systems 21, pp. 1609-1616. 

Sutton, R. S., Maei, H. R., Precup, D., Bhatnagar, S., Silver, D., Szepesvari, Cs., and 
Wiewiora, E. (2009). Fast gradient-descent methods for temporal-difference learning 
with linear function approximation. In: Proceedings of the 26th Annual International 
Conference on Machine Learning. ACM, pp. 993-1000. 

Sutton, R. S., Mahmood, A. R., Precup, D., and van Hasselt, H. P. (2014). A new Q(A) 
with interim forward view and Monte Carlo equivalence. JMLR W&CP 32(1), pp. 568- 
576. 

Szepesvari, Cs. (2010). Algorithms for reinforcement learning. Synthesis Lectures on Ar¬ 
tificial Intelligence and Machine Learning 4(1), pp. 1-103. 

van Hasselt, H. P., Mahmood, A. R., and Sutton, R. S. (2014). Off-policy TD(A) with 
a true online equivalence. In: Proceedings of the 30th Conference on Uncertainty in 
Artificial Intelligence. AUAI Press. 

van Seijen, H. and Sutton, R. S. (2014). True online TD(A). JMLR W&CP 32(1), pp. 692- 
700. 


32 



